3  Restructuring Variables

3.0.1 Age Groups

The population database contains counts for six age groups, whereas the California and LA County databases use four. Because the groupings align — “0–4,” “5–11,” and “12–17” in the population database correspond to the “0–17” group in the others — we summarized the population counts to match the four age-group format.

In the original population database, counts are reported for each single year of age within every demographic subgroup (e.g., county, sex, race/ethnicity). To obtain total estimates per age group and demographic category, the single-year counts were aggregated (summed). Table 1 below presents a subset of the original and restructured population data to illustrate the resulting summarized age groupings.

Code
step2_pop_df <- step1_pop_df %>%
  group_by(county, health_officer_region, 
           sex, race_ethnicity, age_cat) %>%
  summarise(pop = sum(pop), .groups = "drop")

step2_pop_df_recat <- step1_pop_df %>%
  mutate(new_age_group = 
          if_else(age_cat %in% 
              c("0-4", "5-11", "12-17"),
              "0-17", age_cat)) %>%
  group_by(county, health_officer_region, sex, 
           race_ethnicity, new_age_group) %>%
  summarise(pop = sum(pop), .groups = "drop") %>%
  rename("age_cat" = "new_age_group")

3.0.2 Table 1: Reconciling Age Group Categories and Population Counts

Original Age Groups:

County Sex Race / Ethnicity Age Group Population
Alameda FEMALE Hispanic 0-4 11587
Alameda FEMALE Hispanic 12-17 17098
Alameda FEMALE Hispanic 5-11 17761
Total - - - 46,446

New, Aggregate Age Group:

County Sex Race / Ethnicity Age Group Population
Alameda FEMALE Hispanic 0-17 46,446

3.0.3 Infection Dates

To support date-based analyses, infection records will be aggregated by MMWR week and year. For both the Los Angeles County and California datasets, we will generate two new columns: mmwr_year and mmwr_week, then remove the original date-based fields. We will also add columns start_date and end_date to serve as reference points should we need them later.

In the California dataset, the field time_int encodes the year and MMWR week as a six-digit integer (YYYYWW). To create the new fields, we extract the first four digits as mmwr_year and the last two digits as mmwr_week, then drop the original time_int column.

In the Los Angeles County dataset, the codebook identifies a field dt_report as the last day of the MMWR week. However, this field contained only missing values, so it was removed. Instead, we convert the infection date field, dt_dx to a proper date format, and then use the MMWRweek package to derive the mmwr_year and mmwr_week.

Code
##-- California dataset:
step2_ca_df <- step1_ca_df %>%
##--pull MMWR week and year from time_int field
mutate(
  mmwr_year = factor(time_int %/% 100), 
  mmwr_week = factor(time_int %% 100)) %>%
add_start_end_dates() %>%
select(-time_int) %>%
relocate(mmwr_year, mmwr_week, start_date, 
         end_date, .before = everything())


##-- LA county dataset:
step2_la_cnty_df <- step1_la_cnty_df %>%
##--restructure to proper date format
mutate(DATE_FIX = 
  as.Date(parse_date_time(dt_dx, "%d%b%Y"), 
          format = "%Y-%m-%d")) %>% 
##--use date to create new MMWR fields
add_mmwr_week_columns(date_col = "DATE_FIX") %>%
add_start_end_dates() %>%
select(-c(DATE_FIX, dt_dx)) %>%
relocate(mmwr_year, mmwr_week, start_date, 
         end_date, .before = everything()) %>%
relocate(county, .before = age_cat)

To streamline this process, we created two helper functions:

  • add_mmwr_week_columns() : takes date column and adds two fields: mmwr_year and mmwr_week
  • add_start_end_dates() : uses those values to generate corresponding MMWR week start and end dates

The dataframes now have a structure that looks like this:

mmwr_year mmwr_week start_date end_date county age_cat new_infections
2023 22 2023-05-28 2023-06-03 Los Angeles 0-17 15
2023 23 2023-06-04 2023-06-10 Los Angeles 0-17 17
2023 24 2023-06-11 2023-06-17 Los Angeles 0-17 23

3.1 Race and Ethnicity:

Each of the three datasets defines Race / Ethnicity differently. The California dataset uses numeric codes, the Los Angeles County dataset uses full text labels, and the population dataset uses abbreviated text.

To resolve this, we created a crosswalk file (race_ethnicity_map.csv) that aligns the three formats. By joining this crosswalk to each dataset, we ensure that all three contain a consistent set of race and ethnicity variables: each with the numeric code, the abbreviated text, and the full text label.

Code
step2_ca_df <- step2_ca_df %>%
  rename("race_coded" = "race_ethnicity") %>%
  mutate(race_coded = as.character(race_coded)) %>%
  left_join(race_ethnicity_map, by = "race_coded")%>%
  relocate(race_coded, race_short, race_long, .after = sex)

step2_la_cnty_df <- step2_la_cnty_df %>%
  mutate(race_long = clean(race_ethnicity)) %>%
  select(-race_ethnicity) %>%
  left_join(race_ethnicity_map, by = "race_long") %>%
  relocate(race_coded, race_short, race_long, .after = sex)

3.1.1 Race and Ethnicity Crosswalk Table:

race_coded race_long race_short
1 White, Non-Hispanic WhiteTE NH
2 Black, Non-Hispanic Black NH
3 American Indian or Alaska Native, Non-Hispanic AIAN NH
4 Asian, Non-Hispanic Asian NH
5 Native Hawaiian or Pacific Islander, Non-Hispanic NHPI NH
6 Multiracial (two or more of above races), Non-Hispanic MR NH
7 Hispanic (any race) Hispanic
9 Unknown Unknown

3.2 Population Counts by Demographic

To calculate infection rates by demographic groups (such as county, health officer region, sex, or race/ethnicity), we first summarize total population counts for each demographic category within the population dataframe. The population dataset is then joined to both the master database (which merges all three datasets) and the individual California and Los Angeles datasets.

By creating and maintaining these summarized population counts, we avoid having to recalculate them each time we focus on a different demographic group. For example, once this population dataset is joined to the California data, we can easily calculate population-adjusted infection rates that allow valid comparisons across counties with differing population sizes.

Code
step2_pop_df <- step2_pop_df_recat %>%
  
##-- join population database to the race/ethnicity map
  mutate(race_short = clean(race_ethnicity)) %>%
  select(-race_ethnicity) %>%
  left_join(race_ethnicity_map, by = "race_short") %>%
  relocate(race_coded, race_short, race_long, .after = sex) %>%
  
##-- calculate population totals by demographic
  group_by(county, health_officer_region) %>%
    mutate(total_cnty_pop = sum(pop)) %>% ungroup() %>%
  group_by(county, health_officer_region, race_coded, race_short, race_long) %>%
    mutate(total_race_pop = sum(pop)) %>% ungroup() %>%
  group_by(county, health_officer_region, sex) %>%
    mutate(total_sex_pop = sum(pop)) %>% ungroup() %>%
  group_by(health_officer_region) %>%
    mutate(total_HOR_pop = sum(pop)) %>% ungroup()

3.2.1 Table 2: Calculating Infection Rate by 100,000 Population Example

County Total Population Total Infected Proportion Infected Infection Rate per 100K
Alameda 1656037 12427 0.0075041 75.0
Alpine 1165 25 0.0214592 214.6
Amador 40122 767 0.0191167 191.2